Overview

Dataset statistics

Number of variables12
Number of observations683
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory64.2 KiB
Average record size in memory96.2 B

Variable types

NUM11
CAT1

Reproduction

Analysis started2020-07-03 14:08:18.336084
Analysis finished2020-07-03 14:08:39.418573
Duration21.08 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

UniformityOfCellShape is highly correlated with UniformityOfCellSizeHigh correlation
UniformityOfCellSize is highly correlated with UniformityOfCellShapeHigh correlation
df_index has unique values Unique

Variables

df_index
Real number (ℝ≥0)

UNIQUE

Distinct count683
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean351.35578330893117
Minimum0
Maximum698
Zeros1
Zeros (%)0.1%
Memory size5.3 KiB

Quantile statistics

Minimum0
5-th percentile35.1
Q1176.5
median355
Q3526.5
95-th percentile663.9
Maximum698
Range698
Interquartile range (IQR)350

Descriptive statistics

Standard deviation202.5639269
Coefficient of variation (CV)0.5765208275
Kurtosis-1.208617879
Mean351.3557833
Median Absolute Deviation (MAD)175
Skewness-0.02098897312
Sum239976
Variance41032.14449
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
69810.1%
 
23010.1%
 
23910.1%
 
23810.1%
 
23710.1%
 
23610.1%
 
23410.1%
 
23310.1%
 
23210.1%
 
23110.1%
 
Other values (673)67398.5%
 
ValueCountFrequency (%) 
010.1%
 
110.1%
 
210.1%
 
310.1%
 
410.1%
 
ValueCountFrequency (%) 
69810.1%
 
69710.1%
 
69610.1%
 
69510.1%
 
69410.1%
 

ID
Real number (ℝ≥0)

Distinct count630
Unique (%)92.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1076720.2269399706
Minimum63375
Maximum13454352
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum63375
5-th percentile413929.8
Q1877617
median1171795
Q31238705
95-th percentile1334001.2
Maximum13454352
Range13390977
Interquartile range (IQR)361088

Descriptive statistics

Standard deviation620644.0477
Coefficient of variation (CV)0.576420905
Kurtosis257.3684102
Mean1076720.227
Median Absolute Deviation (MAD)104296
Skewness13.74841025
Sum735399915
Variance3.851990339e+11
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
118240460.9%
 
127609150.7%
 
119864130.4%
 
70409720.3%
 
132194220.3%
 
69509120.3%
 
101702320.3%
 
38510320.3%
 
107093520.3%
 
124060320.3%
 
Other values (620)65595.9%
 
ValueCountFrequency (%) 
6337510.1%
 
7638910.1%
 
9571910.1%
 
12805910.1%
 
14293210.1%
 
ValueCountFrequency (%) 
1345435210.1%
 
823370410.1%
 
137192010.1%
 
137102610.1%
 
136982110.1%
 

ClumpThick
Real number (ℝ≥0)

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.44216691068814
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median4
Q36
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.820761319
Coefficient of variation (CV)0.6349966977
Kurtosis-0.6331245309
Mean4.442166911
Median Absolute Deviation (MAD)2
Skewness0.5876542361
Sum3034
Variance7.956694418
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
113920.4%
 
512818.7%
 
310415.2%
 
47911.6%
 
106910.1%
 
2507.3%
 
8446.4%
 
6334.8%
 
7233.4%
 
9142.0%
 
ValueCountFrequency (%) 
113920.4%
 
2507.3%
 
310415.2%
 
47911.6%
 
512818.7%
 
ValueCountFrequency (%) 
106910.1%
 
9142.0%
 
8446.4%
 
7233.4%
 
6334.8%
 

UniformityOfCellSize
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.150805270863836
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation3.065144856
Coefficient of variation (CV)0.9728131675
Kurtosis0.0736791399
Mean3.150805271
Median Absolute Deviation (MAD)0
Skewness1.226404096
Sum2152
Variance9.395112987
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
137354.6%
 
10679.8%
 
3527.6%
 
2456.6%
 
4385.6%
 
5304.4%
 
8284.1%
 
6253.7%
 
7192.8%
 
960.9%
 
ValueCountFrequency (%) 
137354.6%
 
2456.6%
 
3527.6%
 
4385.6%
 
5304.4%
 
ValueCountFrequency (%) 
10679.8%
 
960.9%
 
8284.1%
 
7192.8%
 
6253.7%
 

UniformityOfCellShape
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.2152269399707176
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.988580818
Coefficient of variation (CV)0.929508515
Kurtosis-0.01681562061
Mean3.21522694
Median Absolute Deviation (MAD)0
Skewness1.157890012
Sum2196
Variance8.931615308
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
134650.7%
 
10588.5%
 
2588.5%
 
3537.8%
 
4436.3%
 
5324.7%
 
7304.4%
 
6294.2%
 
8274.0%
 
971.0%
 
ValueCountFrequency (%) 
134650.7%
 
2588.5%
 
3537.8%
 
4436.3%
 
5324.7%
 
ValueCountFrequency (%) 
10588.5%
 
971.0%
 
8274.0%
 
7304.4%
 
6294.2%
 

MarginalAdhesion
Real number (ℝ≥0)

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.830161054172767
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.86456219
Coefficient of variation (CV)1.012155187
Kurtosis0.9424072094
Mean2.830161054
Median Absolute Deviation (MAD)0
Skewness1.509181064
Sum1933
Variance8.205716543
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
139357.5%
 
3588.5%
 
2588.5%
 
10558.1%
 
4334.8%
 
8253.7%
 
5233.4%
 
6213.1%
 
7131.9%
 
940.6%
 
ValueCountFrequency (%) 
139357.5%
 
2588.5%
 
3588.5%
 
4334.8%
 
5233.4%
 
ValueCountFrequency (%) 
10558.1%
 
940.6%
 
8253.7%
 
7131.9%
 
6213.1%
 

SingleEpithelialCellSize
Real number (ℝ≥0)

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.234260614934114
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median2
Q34
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)2

Descriptive statistics

Standard deviation2.223085456
Coefficient of variation (CV)0.6873550777
Kurtosis2.129639279
Mean3.234260615
Median Absolute Deviation (MAD)0
Skewness1.703716401
Sum2209
Variance4.942108947
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
237655.1%
 
37110.4%
 
4487.0%
 
1446.4%
 
6405.9%
 
5395.7%
 
10314.5%
 
8213.1%
 
7111.6%
 
920.3%
 
ValueCountFrequency (%) 
1446.4%
 
237655.1%
 
37110.4%
 
4487.0%
 
5395.7%
 
ValueCountFrequency (%) 
10314.5%
 
920.3%
 
8213.1%
 
7111.6%
 
6405.9%
 

BareNuclei
Real number (ℝ≥0)

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.5446559297218156
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q36
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)5

Descriptive statistics

Standard deviation3.64385716
Coefficient of variation (CV)1.027986138
Kurtosis-0.7988441354
Mean3.54465593
Median Absolute Deviation (MAD)0
Skewness0.9900156547
Sum2421
Variance13.27769501
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
140258.9%
 
1013219.3%
 
5304.4%
 
2304.4%
 
3284.1%
 
8213.1%
 
4192.8%
 
991.3%
 
781.2%
 
640.6%
 
ValueCountFrequency (%) 
140258.9%
 
2304.4%
 
3284.1%
 
4192.8%
 
5304.4%
 
ValueCountFrequency (%) 
1013219.3%
 
991.3%
 
8213.1%
 
781.2%
 
640.6%
 

BlandChromatin
Real number (ℝ≥0)

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.445095168374817
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median3
Q35
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.449696573
Coefficient of variation (CV)0.7110678959
Kurtosis0.1676456428
Mean3.445095168
Median Absolute Deviation (MAD)1
Skewness1.095270469
Sum2353
Variance6.001013297
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
316123.6%
 
216023.4%
 
115022.0%
 
77110.4%
 
4395.7%
 
5345.0%
 
8284.1%
 
10202.9%
 
9111.6%
 
691.3%
 
ValueCountFrequency (%) 
115022.0%
 
216023.4%
 
316123.6%
 
4395.7%
 
5345.0%
 
ValueCountFrequency (%) 
10202.9%
 
9111.6%
 
8284.1%
 
77110.4%
 
691.3%
 

NormalNucleoli
Real number (ℝ≥0)

Distinct count10
Unique (%)1.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.869692532942899
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation3.052666407
Coefficient of variation (CV)1.063760794
Kurtosis0.4735882982
Mean2.869692533
Median Absolute Deviation (MAD)0
Skewness1.420431124
Sum1960
Variance9.318772193
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
143263.3%
 
10608.8%
 
3426.1%
 
2365.3%
 
8233.4%
 
6223.2%
 
5192.8%
 
4182.6%
 
7162.3%
 
9152.2%
 
ValueCountFrequency (%) 
143263.3%
 
2365.3%
 
3426.1%
 
4182.6%
 
5192.8%
 
ValueCountFrequency (%) 
10608.8%
 
9152.2%
 
8233.4%
 
7162.3%
 
6223.2%
 

Mitoses
Real number (ℝ≥0)

Distinct count9
Unique (%)1.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.603221083455344
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.3 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q31
95-th percentile5
Maximum10
Range9
Interquartile range (IQR)0

Descriptive statistics

Standard deviation1.732674146
Coefficient of variation (CV)1.080745609
Kurtosis12.27337364
Mean1.603221083
Median Absolute Deviation (MAD)0
Skewness3.511476241
Sum1095
Variance3.002159697
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
156382.4%
 
2355.1%
 
3334.8%
 
10142.0%
 
4121.8%
 
791.3%
 
881.2%
 
560.9%
 
630.4%
 
ValueCountFrequency (%) 
156382.4%
 
2355.1%
 
3334.8%
 
4121.8%
 
560.9%
 
ValueCountFrequency (%) 
10142.0%
 
881.2%
 
791.3%
 
630.4%
 
560.9%
 

Class
Categorical

Distinct count2
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size5.3 KiB
2
444
4
239
ValueCountFrequency (%) 
244465.0%
 
423935.0%
 

Length

Max length1
Median length1
Mean length1
Min length1

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

df_indexIDClumpThickUniformityOfCellSizeUniformityOfCellShapeMarginalAdhesionSingleEpithelialCellSizeBareNucleiBlandChromatinNormalNucleoliMitosesClass
0010000255111213112
11100294554457103212
2210154253111223112
3310162776881343712
4410170234113213112
5510171228101087109714
66101809911112103112
7710185612121213112
8810330782111211152
9910330784211212112

Last rows

df_indexIDClumpThickUniformityOfCellSizeUniformityOfCellShapeMarginalAdhesionSingleEpithelialCellSizeBareNucleiBlandChromatinNormalNucleoliMitosesClass
6736896545461111211182
6746906545461113211112
675691695091510105454414
6766927140393111211112
6776937632353111212122
6786947767153111321112
6796958417692111211112
6806968888205101037381024
68169789747148643410614
68269889747148854510414